-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Rdf ingestion mvp #15741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Rdf ingestion mvp #15741
Conversation
|
Linear: ING-1308 |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
✅ Meticulous spotted 0 visual differences across 951 screens tested: view results. Meticulous evaluated ~8 hours of user flows against your PR. Expected differences? Click here. Last updated for commit 8c9e53d. This comment will update as new commits are pushed. |
Bundle ReportChanges will increase total bundle size by 2.94kB (0.01%) ⬆️. This is within the configured threshold ✅ Detailed changes
Affected Assets, Files, and Routes:view changes for bundle: datahub-react-web-esmAssets Changed:
Files in
|
83de506 to
4ba2ba2
Compare
4ba2ba2 to
0238c98
Compare
….Class and RDFS.Class - Updated the logic in `GenericDialect` to exclude ontology construct types while allowing OWL.Class and RDFS.Class to coexist with SKOS.Concept, enhancing compatibility with RDF standards.
- Updated the rdflib dependency in setup.py to specify a version range of >=6.0.0,<7.0.0, ensuring compatibility with existing RDF handling features.
- Updated the rdflib dependency in setup.py to specify an exact version of 6.3.2, ensuring compatibility with existing RDF handling features and preventing potential issues with future releases.
…er, and URN generator - Introduced new unit tests for various edge cases in RDF dialects (Generic, FIBO, Default), including handling of empty graphs, missing labels, and special characters. - Added tests for the RDF loader to cover format validation, file handling, URL loading, and zip file scenarios. - Implemented edge case tests for the URN generator, focusing on IRI parsing and platform normalization. - Enhanced overall test coverage to ensure robustness and reliability of RDF processing components.
… requests_file - Modified the RDF plugin dependencies in setup.py to add specific versions of requests (2.32.5) and requests_file (3.0.1) alongside rdflib (6.3.2), ensuring compatibility and stability for RDF processing.
…n in documentation generation - Added checks and warnings for missing platforms and plugins when processing README and documentation files, ensuring robustness in the documentation generation process. - Improved logging to provide clearer feedback when encountering issues with platform or plugin names during the generation of custom documentation.
…tion generation - Added a new metric for tracking missing capability data in the PluginMetrics class. - Changed error logging to a warning when a plugin is not found in capability data, incrementing the new metric instead of the failed count. - Updated exit behavior to only return an error code for actual failures, enhancing the robustness of the documentation generation process.
bfdc088 to
545aad0
Compare
- Added new capabilities for ABS Data Lake, including support for containers, data profiling, and tags. - Enhanced Athena source capabilities with additional features such as lineage fine, schema metadata, and test connection. - Updated platform details and support statuses for both ABS and Athena sources. - Removed outdated entries and streamlined the capability structure for clarity.
…and URN generator - Introduced extensive unit tests for the RDF loader, covering URL loading, error handling, and path traversal protection. - Added tests for the EntityRegistry, focusing on registration methods, CLI name mapping, and processing order. - Implemented tests for the URN generator, addressing IRI parsing, platform derivation, and structure preservation. - Enhanced overall test coverage to ensure robustness and reliability of RDF processing components.
…ests - Updated the test cases in `test_registry_comprehensive.py` to include type hints for the `EntityProcessor` instantiation, improving code clarity and type safety. - Ensured consistency in the test setup for better maintainability and readability.
- Introduced extensive unit tests for the RDF loader, enhancing coverage for format validation, error handling, and file/URL detection. - Added comprehensive tests for the URN generator, focusing on platform normalization, IRI parsing, and path derivation edge cases. - Improved overall test coverage to ensure robustness and reliability of RDF processing components.
… tests - Updated unit tests in `test_urn_generator_additional_coverage.py` to include type ignores for invalid type checks in platform normalization, IRI path derivation, and group name generation methods. - Enhanced error handling assertions to maintain clarity while addressing type-related warnings from static analysis tools.
Summary
This PR introduces a new RDF ingestion source for DataHub, enabling ingestion of RDF/OWL ontologies (Turtle, RDF/XML, JSON-LD, N3, N-Triples) with a focus on business glossaries. The source extracts glossary terms, term hierarchies, and relationships from RDF files using standard vocabularies like SKOS, OWL, and RDFS.
What's New
Core Features
type: rdf) - Native DataHub plugin for RDF/OWL ontologiesskos:Conceptandowl:Classto DataHub GlossaryTermsskos:broaderandskos:narrowerrelationships asisRelatedTermsstateful_ingestionconfigplatform_instanceconfigArchitecture
test_connection()for connection validationCapabilities
The source supports the following DataHub capabilities:
skos:broaderandskos:narrowerstateful_ingestion.enabled: trueplatform_instanceconfigskos:definitionorrdfs:comment)Testing
Test Coverage
export_only,skip_export)Test Files
tests/unit/rdf/- Unit tests for individual componentstests/integration/rdf/- Integration tests with golden file validationDocumentation
User Documentation
docs/sources/rdf/rdf.md- Comprehensive user guide (489 lines)Recipe Examples
docs/sources/rdf/rdf_recipe.yml- Example recipes for basic and stateful ingestionIntegration Test Documentation
tests/integration/rdf/README.md- Detailed guide for running integration testsConfiguration Example
source:
type: rdf
config:
source: ./glossary.ttl
format: turtle
environment: PROD
stateful_ingestion:
enabled: true
remove_stale_metadata: true
export_only:
- glossary## Files Changed
Technical Notes
Security & Performance
Code Quality
New Files
src/datahub/ingestion/source/rdf/ingestion/rdf_source.py- Main source implementationsrc/datahub/ingestion/source/rdf/core/rdf_loader.py- RDF loading utilities with securitysrc/datahub/ingestion/source/rdf/core/urn_generator.py- URN generation with encodingsrc/datahub/ingestion/source/rdf/entities/base.py- Base interfaces for entity processingsrc/datahub/ingestion/source/rdf/entities/registry.py- Thread-safe entity registrydocs/sources/rdf/rdf.md- User documentationdocs/sources/rdf/rdf_recipe.yml- Recipe examplestests/integration/rdf/test_rdf_source.py- Integration teststests/unit/rdf/- Unit tests (multiple files)Modified Files
setup.py- Added RDF source to entry points (line 862)Breaking Changes
None - This is a new feature addition with no breaking changes to existing functionality.
Support Status
The RDF source is marked as INCUBATING (
SupportStatus.INCUBATING), indicating it's ready for community adoption but may have minor version changes in future releases based on feedback.Checklist
setup.py@platform_name,@config_class,@support_status)test_connection()implemented